Statistical modelling in R for ecologists (Part 2)
Made for
Bush Heritage
Date
March 28, 2024
Here we build upon modelling skills developed in Part 1 of this course. In Part 1, we learned how to fit models to data using Frequentist statistics.
Here we will fit the same models in a Bayesian paradigm. Then we will build more complex, multivariate generalised linear models to demonstrate when they might be preferred over Frequentist approaches.
We’ll also compare multivariate generalised linear models (i.e., model-based ordination) to other algorithmic approaches typically used for analysing ecological community data (e.g., nMDS).
Finally, we’ll show how these techniques can be used to model joint species distributions, and how R can handle spatial data.
Download the models you’ll need for today from here
Diving into the fundamental and nuanced differences between Frequentist and Bayesian statistics is beyond the scope of this short course. But you can find some nice explanations of the key differences here and here.
From my point of view, both are useful. Often, I start by fitting Frequentist models to my data and then move to more complex Bayesian models if needed.
Why?
Bayesian models require more ‘number crunching’ to estimate parameters (coefficients), and therefore can take a long time to fit compared to their Frequentist analogues.
Bayesian models require us to define ‘priors’ for our model parameters. If we don’t have much prior information on model parameters (i.e., our priors are ‘uninformative’, which is usually the default), the coefficients estimated by Bayesian models should be the same as their Frequentist analogues.
Sometimes though, we need to go Bayesian. For example, if we want to model spatio-temporal autocorrelation in our data, we might like to use Bayesian models in the R-INLA package.
Another advantage of Bayesian models is that our parameters are treated as random variables rather than fixed. In practical terms this means that, rather than a single point estimate for model coefficients (+/- confidence intervals), we get an entire probability distribution of plausible values for each coefficient (referred to as the posterior distribution). This provides us with a richer understanding our coefficients and therefore of the ecological relationships we are trying to estimate.
Here, we’ll go Bayesian because our ultimate goal is to fit a complex multivariate generalised linear model and account for spatial autocorrelation in our observations by including a spatial random effect.
Before we do though, we need to learn a little bit more about how Bayesian models estimate parameters.
Bayesian vs. Frequentist parameter estimation
In Part 1 of this course we touched on how model parameters (coefficients) are estimated with either ordinary least squares (general linear models) or maximum likelihood (generalised linear models).
The Bayesian models we’ll use today use Markov Chain Monte Carlo (MCMC) sampling to estimate the posterior distribution for each parameter (coefficient).
We don’t have time to discuss the details today, but the key things you need to know when fitting a Bayesian model with MCMC sampling are:
MCMC sampling is an algorithm that results in a chain of samples that form a probability distribution which, in our case, is the posterior distribution for each of our model parameters. Read more here
We have to set the following MCMC sampling parameters:
the number of chains (typically 4),
how many samples of the posterior to obtain in each chain (typically 1000),
the thinning rate (how often do we keep samples in each chain, e.g., a thinning rate of 1 means we keep every sample, a rate of 5 means we keep every 5th sample),
the length of the transient (also called ‘burn-in’) - this is how many initial samples we discard from each chain.
After we set the MCMC sampling parameters and fit the model, we need to check that the MCMC sampling has converged. This means that, in addition to checking traditional structural model assumptions for linear models like we learned in Part 1 (i.e., normality and homoskedasticity of residuals), we also need to check MCMC convergence diagnostics before making any inference or predictions from our model.
Okay, now we’re ready to try fitting a Bayesian model.
General(ised) linear models in a Bayesian framework
In Part 1, we fitted general and generalised linear models in a Frequentist framework. (Remember general linear models assume a normal error structure (think tree circumference as our response variable (\(y\))), whereas generalised linear models can accommodate non-normal error structures (think species counts or presence-absence as the response \(y\))).
Let’s try fitting the same models in a Bayesian framework using MCMC sampling for parameter (coefficient) estimation. We’ll use weak, uninformative priors (the default), meaning our coefficient estimates from the Bayesian model should be the same as the Frequentist.
In the interest of time, we’ll just repeat the generalised linear model of species abundance in relation to fire severity that we did in Part 1. But if you’re keen, feel free to try the general linear model of tree circumference (you’ll just need to specify a normal (gaussian) distribution instead of a poisson).
First, let’s simulate the data we need for species abundance and fire severity (repeating exactly as we did in Part 1).
library(ggplot2)set.seed(123) # set seed for random number generator to ensure results are reproducibleb0 <-log(10) # average species abundance when fire severity is = 0 (i.e, intercept) is 10b1 <--log(1/0.5) # species abundance decreases by a factor of 2 (aka 50% decrease in abundance) for a one unit increase in fire severityn <-100# number of observationsX <-runif(n, min =0, max =1) # fire severity measurementslambda <-exp(b0 + b1*X) # estimate mean species abundance (lambda) on multiplicative scale using the exponenty <-rpois(n, lambda) # simulate species abundance observation from each lambdadf <-data.frame(Fire_severity = X, Species_abundance = y) # make a dataframe of observationsggplot(df) +aes(x = Fire_severity, y = Species_abundance) +geom_point() +ylab('Species abundance') +xlab('Fire severity') +geom_smooth(method ='lm') +#geom_smooth(method = 'loess') +theme_classic()
`geom_smooth()` using formula = 'y ~ x'
Fitting a Bayesian GLM and checking MCMC convergence
Now we can fit the model. We’ll use {Hmsc} to do this - a flexible Bayesian modelling framework for fitting complex hierarchical models with spatial or non-spatial random effects. ‘Hmsc’ is an acronym for ‘Hierarchical modelling of species communities’. We can use it to fit univariate models (as we are right now) and multivariate models (as we’ll do later).
Let’s load the package and fit a Bayesian generalised linear model where our response \(y\) is species abundance and our explanatory variable \(X\) is fire severity.
library(Hmsc)
Loading required package: coda
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
# set up the modelm <-Hmsc(Y =as.matrix(df$Species_abundance),XData = df,XFormula =~ Fire_severity,distr ='poisson')# fit the model with MCMC samplingm_fit <-sampleMcmc(m, nChains =4, samples =1000, thin =1, transient =2500, verbose = F)
setting updater$GammaEta=FALSE due to absence of random effects included to the model
So, as you can see, takes a bit longer than our frequentist glms. Now we want to check whether the MCMC sampling has converged.
First, we’ll extract the posterior distributions for the coefficients estimated by our model (i.e., the y-intercept and the beta coefficients).
mpost <-convertToCodaObject(m_fit)
Now we can look at trace plots of the MCMC sampling for each of the posterior distributions to see whether they have converged. We expect to see random mixing of the four MCMC chains (a fuzzy caterpillar), and bell-shaped, unimodal posterior distributions.
plot(mpost$Beta)
Not ideal. The chains are not mixing very well, and the posterior distributions are not bell-shaped (are a bit too pointy, more like a witch’s hat).
We can also get some quantitative convergence diagnostics, such as the effective sample size of the MCMC chains, to help us diagnose the issue. We would expect the effective samples size to be close to 4000 (1000 samples/chain * 4 chains). If it is less, that indicates there is auto-correlation in the MCMC sampling
Whoa, much less - we have an issue with autocorrelation in the sampling.
We can also check the Gelman-Rubin potential scale reduction factors (which should be close to 1, indicating the individual chains give similar results).
gelman.diag(mpost$Beta,multivariate=FALSE)$psrf
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.009974 1.029671
B[Fire_severity (C2), sp1 (S1)] 1.013335 1.034058
Higher than 1. This confirms what we saw above in the trace plots - the chains do not mix well.
Let’s try increasing the thinning rate to improve convergence of the MCMC sampling. This will mean the model will take a lot longer to fit - it’s doing a a lot more sampling because it can only keep every 100th sample to get a total of 1000 samples for each chain.
So I’m going to ask it to run the MCMC chains in ‘parallel’ to speed it up using the argument nParallel. (Note, in the remainder of the notes the code for fitting models is commented out with a # because it takes a while to fit them (try running them on your own time). So instead we provide the fitted models as RDS files that can be read in with the readRDS function - see below).
# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 10, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_10thinning.rds')m_fit <-readRDS('model_10thinning.rds')# extract posterior distributions for beta coefficientsmpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Looks better, what about our quantitative diagnostics?
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.004772 1.016165
B[Fire_severity (C2), sp1 (S1)] 1.003701 1.012686
Improved - the psrf indicate the chains are providing similar results (as we can see in the trace), but the effective sample size is still low. Let’s bump up the thinning a bit more.
# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning.rds')m_fit <-readRDS('model_100thinning.rds')# extract posterior distributions for beta coefficientsmpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Excellent, even better than before. How about the effective sample size?
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.0003794 1.003090
B[Fire_severity (C2), sp1 (S1)] 0.9997679 1.001205
Much better. We’re happy now that our MCMC sampling is converged, meaning that, as long as our structural model assumptions are met, we can make inference from this model. Note, however, that sometimes you might notice that earlier samples in your trace plots look different from later samples. In this case you can try increasing the transient (i.e. the initial samples that are discarded).
Checking structural model assumptions
Now we want to check the same model assumptions we have when fitting Frequentist GLMs: the model residuals are normally distributed and have homogeneity of variance.
Unfortunately our handy plot function we used in Part 1 won’t work for {Hmsc} model objects. So we have to calculate the model residuals ourselves and plot them.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =1) # get the mean predicted value species abundance (n = 4000)Residuals <-scale(m_fit$Y - Fitted.mean) # calculate standarised residualspar(mfrow =c(1,2)) # set graphical parametersplot(Fitted.mean, Residuals); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm(Residuals); qqline(Residuals) # qq plot
Since we are considering count data where high dispersion can mean the errors are not normally distributed, we can also try randomised quantile residuals, which we would expect to be normally distributed even if the data is highly dispersed (but not overdispersed).
a <-ppois(m_fit$Y-1, Fitted.mean)b <-ppois(m_fit$Y, Fitted.mean)qresids <-qnorm(runif(n =length(Fitted.mean), min = a, max = b))par(mfrow =c(1,2)) # set graphical parametersplot(Fitted.mean, qresids); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm(qresids); qqline(qresids) # qq plot
Both look about the same (so high dispersion in the count data isn’t an issue for interpreting the residuals). And looks very similar to the diagnostic plots we made in Part 1, where we fitted a Frequentist GLM to the same data.
Now let’s check the coefficient estimates are close to our known truths.
summary(mpost$Beta)
Iterations = 2600:102500
Thinning interval = 100
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
B[(Intercept) (C1), sp1 (S1)] 2.3526 0.07106 0.001124 0.001123
B[Fire_severity (C2), sp1 (S1)] -0.7703 0.13709 0.002168 0.002197
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
B[(Intercept) (C1), sp1 (S1)] 2.213 2.305 2.3532 2.399 2.491
B[Fire_severity (C2), sp1 (S1)] -1.038 -0.862 -0.7712 -0.676 -0.506
Remember, these predictions are in log-space, so we need to back-transform them to the natural scale (the inverse of the natural log is the exponent). But, as you can see, they are almost exactly the same as our Frequentist model from Part 1!
exp(summary(mpost$Beta)$statistics[1]) # back transform the y-intercept (mean species abundance when fire severity = 0)
[1] 10.51245
exp(summary(mpost$Beta)$statistics[2]) # back transform the slope
[1] 0.4628578
Multivariate Generalised linear model
Okay, so now we know how to fit a Bayesian GLM, and we can see that with default, uninformative priors it produces the same coefficient estimates as it’s Frequentist analogue. It took longer to fit though, and there were more diagnostics to check (MCMC convergence in addition to standard structural model assumptions). So why go Bayesian?
Well, in the previous example, we might opt to stick with a Frequentist GLM in the interest of saving time. But, what if we had observations of multiple species’ abundances that we wanted to model together? We can build upon our Bayesian model to do so, and include a site-level random effect to estimate residual correlations in species’ abundances (i.e., the correlations in species’ abundances left unexplained, after accounting variability explained by explanatory variables (e.g., Fire severity)).
You might be thinking, this sounds like a constrained ordination. You would be right! Fitting multivariate GLMs is analagous to fitting a constrained ordination. The difference is, the former is a model-based approach to ordination (which means we can do things like account for lack of independence in our samples due to spatial autocorrelation (we’ll do that later)), and the latter is an algorithmic approach to ordination.
For interest, we’ll fit a algorithmic unconstrained ordination (nMDS) to a multivariate dataset of species’ abundance and compare this to a multivariate GLM.
Let’s start by simulating a new dataset of species abundance data for multiple species. We’ll also make this a spatial example with longitude (x) and latitude (y) coordinates for each location where species are surveyed.
library(MASS)set.seed(123) # set seed for random number generator to ensure results are reproduciblen <-100# number of observationsns <-5# number of species# simulate relationships between species abundance and fire severityb0 <-log(rep(10, ns)) # y-intercept for each species is 10b1 <-log(c(2,1.1,1,1.1,2))*c(-1,-1,0,1,1) # slopes for each species (spp 1&2 decrease by factor of 2 & 1.1, spp3 no change, spp 4&5 increase by factor of 1.1 &2)beta <-cbind(b0,b1) # bind beta coefficients togetherX <-cbind(rep(1,n),runif(n, min =0, max =1)) # fire severity measurements# simulate spatial autocorrelation in the abundancesxycoords <-data.frame(x =runif(n, 152.1,152.5), y =runif(n, 28.1, 28.5)*-1) # make spatial coordinates for survey locationssigma.spatial <-c(2) # standard deviation of spatial covariancealpha.spatial <-c(0.35) # spatial scale parameter 'alpha' for spatial covariancesigma <- sigma.spatial^2*exp(-as.matrix(dist(xycoords))/alpha.spatial) # exponentially decreasing spatial covariance matrixeta <-mvrnorm(mu=rep(0,n), Sigma=sigma) # draw spatially structured residuals from a multivariate normal distribution (spatially structured latent predictor)beta.spatial <-c(1,2,-2,-1,0) # define how species are spatially auto-correlated (i.e., respond to spatially structured latent predictor (eta))spatial_res <- eta%*%t(beta.spatial) # calculate spatially structured residuals for each specieslambda <-exp((X%*%t(beta))) + spatial_res # add spatial residuals to mean expected abundance of each species due to fire severity# make final data frame of simulated species abundances with spatial autocorrelationy <-data.frame(apply(data.frame(lambda), 2, function(x) rpois(n, x))) # simulate species abundance observations from the poisson distributioncolnames(y) <-c('spp1', 'spp2', 'spp3', 'spp4', 'spp5') # add column names to data framedf <-data.frame(xycoords, y, Fire_severity = X[,2]) # add fire severityhead(df)
What do the simulated relationship’s look like? Let’s do some plots. First, it will be easiest to plot if we make this data tidy. This means that, instead of having a separate column for each species’ abundance observations (wide format data) we pivot to long format, whereby we have only one column for abundance observations (so every unique observation of abundance has its own row).
Great, now we can more easily plot a trend line for each species’ abundance in response to fire severity.
ggplot(df_long) +aes(x = Fire_severity, y = abundance, col = spp) +geom_point(alpha =0.5) +geom_smooth(method ='lm') +ylab('Species abundance') +xlab('Fire Severity') +theme_classic()
`geom_smooth()` using formula = 'y ~ x'
What trends did we expect to see? We can take the exponent of the slope (\(b{_1}\) coefficients) to get the expected multiplicative change in species abundance for every one unit increase in fire severity. If we multiply this by the y-intercept (mean abundance at fire severity = 0), that will give us the abundance we expect when fire severity = 1.
Pretty close for most species, although some deviations from expected (especially for species 3 - we would have expected a flat line around 10). The deviations will stem from the spatial autocorrelation that we simulated in the data. But don’t worry about it for now, we’ll discuss that in more detail later.
Spatial data in R
Let’s turn our species abundance dataframe into a spatial one and map it. This is easy in R.
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
Tip Use the layered diamonds in the top left of each map to turn the abundance layer off and see fire severity below.
An exploratory, algorithmic ordination
Before we try model-based ordination of our multivariate species abundance data, let’s try a quick exploratory nMDS to look for patterns.
library(vegan)nmds <-metaMDS(df[,3:7], distance ='bray', k =2)
nmds
Call:
metaMDS(comm = df[, 3:7], distance = "bray", k = 2)
global Multidimensional Scaling using monoMDS
Data: wisconsin(df[, 3:7])
Distance: bray
Dimensions: 2
Stress: 0.1846236
Stress type 1, weak ties
Best solution was repeated 3 times in 20 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on 'wisconsin(df[, 3:7])'
The stress is an estimate of the goodness of fit of the nMDS. It is less than 0.2, indicating a decent fit.
Let’s plot the ordination and see which survey locations are similar based on the abundance and composition species. First we need to extract the site scores and species scores for the two nMDS dimensions.
So shows patterns we expected. We could do a permutational analysis of variance (PERMANOVA) to formally test whether species abundance and composition differs between fire classes.
adonis2(nmds_scores[,3:7] ~ Fire_class, data = nmds_scores)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = nmds_scores[, 3:7] ~ Fire_class, data = nmds_scores)
Df SumOfSqs R2 F Pr(>F)
Fire_class 1 0.2767 0.12578 14.1 0.001 ***
Residual 98 1.9231 0.87422
Total 99 2.1998 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, the inference we can make regarding sizes of effects for individuals species and the strength of the associations between species after accounting for fire severity is limited.
So let’s try model-based ordination with a multivariate GLM instead and see if we gain any extra inference.
Note that there are many other approaches to algorithmic ordination than we’ve covered here. nMDS is a type of ‘unconstrained’ ordination, but there are other approaches for ‘constrained’ ordination which are a closer analogue to the model-based ordination we’ll try next.
Find out more about different types of ordination and how to perform them in Rhere.
Model-based ordination
As before, we’ll set up our model and fit it with MCMC sampling. The main difference is this time we include a site-level random effect that models the residual correlations across species as latent (unobserved) variables.
# set up random effectstudyDesign <-data.frame(sample =as.factor(1:nrow(df)))rL <-HmscRandomLevel(units = studyDesign$sample)# set up the modelm <-Hmsc(Y =as.matrix(df[,3:7]),XData = df,XFormula =~ Fire_severity,distr ='poisson',studyDesign = studyDesign, ranLevels =list(sample = rL))# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning_mult.rds')m_fit <-readRDS('model_100thinning_mult.rds')mpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Okay, so convergence diagnostics look good. Let’s take a look at the structural model assumptions. We’ll use randomised quantile residuals, and this time we’ll set up a for loop that will ‘loop’ through each of our 5 species and plot them separately.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =c(1,2)) # get the mean predicted value species abundance (n = 4000)par(mfrow =c(5,2)) # set graphical parametersqresids.df <-data.frame(Fitted.mean) # dataframe for storing quantile residuals for each speciesfor(i in1:ncol(Fitted.mean)){ Fitted.mean.spp <- Fitted.mean[,i] y_spp <- m_fit$Y[,i] a <-ppois(y_spp-1, Fitted.mean.spp) b <-ppois(y_spp, Fitted.mean.spp) qresids.df[,i] <-qnorm(runif(n =length(y_spp), min = a, max = b))plot(Fitted.mean.spp, qresids.df[,i], main =colnames(Fitted.mean)[i]); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm( qresids.df[,i], main =colnames(Fitted.mean)[i]); qqline(qresids.df[,i]) # qq plot}
Species 1 looks okay, but there are some strong patterns in the residuals vs. fitted plots for species 2-5. What’s going on here?
Let’s also check whether out coefficient estimates are biased from the known truths. Let’s remind ourselves what those were for each species.
So definitely some bias, although generally the sign of the effects are correct (i.e. negative vs. positive).
Let’s see if we can improve our model by including a spatial random effect to account for spatial autocorrelation.
Accounting for spatial autocorrelation
What is spatial autocorrelation? It’s Tobler’s first law of geography, ‘Everything is related to everything else, but near things are more related than distant things’.
This means that observations of species abundance that are closer together in space are more likely to be similar than observations made farther apart. This means our observations are not independent of each other - which is an assumption of general(ised) linear models.
We can include a spatial random effect that will estimate and adjust for spatial autocorrelation in our species abundance observations. Let’s try it.
# set up spatial random effectstudyDesign <-data.frame(sample =as.factor(1:nrow(df)))rL.spatial <-HmscRandomLevel(sData = df[,c(1,2)])rL <-HmscRandomLevel(units = studyDesign$sample)# set up the modelm <-Hmsc(Y =as.matrix(df[,3:7]),XData = df,XFormula =~ Fire_severity,distr ='poisson',studyDesign = studyDesign, ranLevels =list(sample = rL.spatial))# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning_mult_spatial.rds')m_fit <-readRDS('model_100thinning_mult_spatial.rds')mpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Now let’s see if our residual diagnostic plots look better.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =c(1,2)) # get the mean predicted value species abundance (n = 4000)par(mfrow =c(5,2)) # set graphical parametersqresids.df <-data.frame(Fitted.mean) # dataframe for storing quantile residuals for each speciesfor(i in1:ncol(Fitted.mean)){ Fitted.mean.spp <- Fitted.mean[,i] y_spp <- m_fit$Y[,i] a <-ppois(y_spp-1, Fitted.mean.spp) b <-ppois(y_spp, Fitted.mean.spp) qresids.df[,i] <-qnorm(runif(n =length(y_spp), min = a, max = b))plot(Fitted.mean.spp, qresids.df[,i], main =colnames(Fitted.mean)[i]); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm( qresids.df[,i], main =colnames(Fitted.mean)[i]); qqline(qresids.df[,i]) # qq plot}
Much better!
We should also check for spatial autocorrelation in the residuals. We’ll do that visually by mapping the residuals.
Not perfect, but better than our model without spatial autocorrelation. In particular, the coefficient discrimination is much better, see here for a more detailed explanation.
So, what’s the advantage over algorithmic ordination approaches? With model-based ordination we can account for spatial autocorrelation and obtain more accurate and precise estimates of the effect of fire severity on the abundance of each species.
And plot the residual correlations between species (i.e., the correlations between species left over after we account for variability in abundances due to fire severity). These residual correlations could be from species interacting with one another (e.g., predator-prey interactions), or they could be from some important explanatory variables that that cause species to co-vary, but we haven’t included in the model.
In this case we know that the residual correlations are the result of spatial autocorrelation in species’ abundances because we simulated them that way. In the real-world you wouldn’t know what is driving the residual correlations, but the patterns you see might allow you to make some hypotheses.
So, after accounting for variability in species abundance due to fire severity, we infer that abundances of species 1 are positively correlated with abundances of species 2, and negatively correlated with species 3 and 4. This is exactly what we expected, given how we simulated spatial autocorrelation in the data.
Now we can extract the posterior means of the \(\eta\) and \(\lambda\) parameters, which represent the site loadings (\(\eta\)) and species loadings (\(\lambda\)) on the spatial latent variables (random effects) estimated by our model. We can use these latent variables to plot an ordination. Note this will look different from the nMDS ordination we made earlier, as what we’ve done here is a constrained ordination. The nMDS is an unconstrained ordination.
Now that we’ve removed variability in species abundances due to fire severity, we can see site species composition and abundance is longer are associated fire severity.
Predicting and mapping species abundance everywhere
Now that we are happy with our spatial, multivariate model of species abundance we might like to use it to make predictions of expected species abundance with fire severity everywhere in our survey area. In this sense you can think of our model as a joint species distribution model.
First we will generate a raster of expected fire density by interpolating between our observations of fire severity. As a reminder, we had simulated fire severity at our survey locations and it looks like this.
Let’s start by making a grid covering our survey area, which we can then use for interpolation. A spatial grid is what we often refer to as ‘raster’ data (the points above are referred to as ‘vector’ data). {terra} is good R package for working with spatial raster data.
Now do the interpolation using thin plate splines to predict fire severity everywhere.
library(fields)
xy <-data.frame(xyFromCell(grid_fire, 1:ncell(grid_fire))) # make a datafame of longitude and latitude coordinates for each grid celltps <-Tps(xy, values(grid_fire)) # fit a thin plate spline regression
Warning:
Grid searches over lambda (nugget and sill variances) with minima at the endpoints:
(GCV) Generalized Cross-Validation
minimum at right endpoint lambda = 483.0456 (eff. df= 3.001006 )
grid_fire_i <-interpolate(rast(grid_fire), tps) # make a raster with interpolated values from the fitted tps regressionqtm(grid_fire_i)
Now we can make predictions of species abundance using the grid of interpolated fire severity. First, we need to set up our new data for making predictions by extracting x and y coordinates and fire severity values from our interpolated raster.
Finally, we’re ready to make predictions. We’ll generate an entire posterior predictive distribution (i.e., predictions for our sample of our posterior distribution of parameters), from which we’ll summarise as the both the mean predicted abundance value (preds_abund) and mean predicted probability of occurrence (preds_occur).
Use the layered diamonds in the top left corner to turn the predicted abundance for species 1 off and see predicted occurrence instead. Now let’s try layering up and comparing predicted abundance and distributions for all species. We’ll use a for loop to create a raster stack of predicted abundance for each species.
stack <-list() # empty list for storing grids for each species' predictionsfor(i inseq_along(1:ncol(preds_abund))){stack[[i]] <-rast(grid, vals = preds_abund[,i])}stack <-do.call(c, stack) # turn list into a raster stacknames(stack) <-colnames(preds_abund)stack
There we have it, we’ve mapped out predictions of abundance from our joint species distribution model! (Again, make sure you use the layered diamonds in the top left to turn layers on and off.)
Saving static maps
Here’s a quick example of how we can turn our interactive maps into something that is publication ready by having more control over the aesthetics than is possible with the qtm function we’ve been using previously.
tmap_mode('plot')
tmap mode set to plotting
map <-tm_shape(stack) +# Specify the shapefile or raster to maptm_raster(title ='Species abundance') +# What is being displayed? Here, it's a rastertm_shape(df.sf) +# Add another shapefile to our maptm_dots( # Now we are displaying points (dots)title ='Fire severity', # Title as it appears in the legendcol ='Fire_severity', # Colour the points by variablepal ='BuGn', # Define the colour palettesize =0.05, scale =2) +tm_facets(as.layers = T) +tm_layout(panel.labels =c('Sp.1', 'Sp.2', 'Sp.3', 'Sp.4', 'Sp.5'), # Define labelslegend.position =c(-0.5, -0.05), # Customise legendlegend.title.fontface ='bold') +tm_compass(position =c(0.001, 0.82)) +# Add a north arrowtm_scale_bar(position =c(0.65, 0.015), # Add a scale barbreaks =c(0,5,10),text.size =0.4)map
To save it:
#tmap_save(map, 'map.png')
Resources
Hopefully this short course gave you a sense of how flexible Bayesian statistical approaches can be. For example, we used the same Bayesian modelling framework to:
fit a linear model,
fit a generalised linear model,
fit a multivariate generalised linear model (which is also model-based ordination), and
fit a spatial, multivariate generalised linear model (which is also a joint species distribution model).
Phew, that is a lot!
Find out more about fitting Bayesian models and using R as a GIS here:
My absolute favourite intro to Bayesian thinking, a really great read: Statistical Rethinking